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Abstract. Nowadays, information management systems deal with data 
originating from different sources including relational databases, NoSQL 
data stores, and Web data formats, varying not only in terms of data 
formats, but also in the underlying data model. Integrating data from 
heterogeneous data sources is a time-consuming and error-prone engi- 
neering task; part of this process requires that the data has to be trans- 
formed from its original form to other forms, repeating all along the life 
cycle. With this report we provide a principled overview on the funda- 
mental data shapes tabular, tree, and graph as well as transformations 
between them, in order to gain a better understanding for performing 
said transformations more efficiently and effectively. 



1 Motivation 

These days, content and information management systems have to deal with 
data originating from an array of sources, such as relational databases, NoSQL 
data stores, and Web data formats. The data sources vary not only in terms 
of data formats, but first and foremost in the underlying data model, be it 
implicit — such as with JSON — or explicit, think: RDF. 

As recently put forward by Helland [Helll], data integration of heterogeneous 
data sources is a time-consuming, costly, and error-prone engineering task. Typ- 
ically, the data has to be transformed from its original form to other forms, 
repeating all along the life cycle. For example, let us assume we want to pub- 
lish data from an government agency such as spreadsheets containing statistical 
information into the Linked Open Data cloud 3 . One task of the Linked Data 
life cycles would then be to transform the original tabular spreadsheet data into 
(graph-shaped) RDF. Once we have the data transformed into RDF according 
to, say, the RDF Data Cube Vocabulary 4 , we want to visualize it in an appealing 
way, so we decide to use the Google Charts API 5 requiring us to provide input 
as tabular data. So, again we have to transform a graph into a tabular, and then 
the application visualize the information in an appealing way. 



' http://lod-cloud.net 
http: //www . w3 . org/TR/vocab- data- cube/ 
https : //developers . google . com/chart/ 



Apparently, even in this toy example, we have to transform data from its 
original form to potentially many (intermediate) other forms. This was the mo- 
tivation for us to compile this report, aiming to provide a principled overview 
of possible fundamental data shapes and transformations between them. In the 
following we will focus on the data transformation within the context of the 
"Extract, Transform and Load" process, a valuable process, growing in its use 
and scale. We use the term data shape to refer to how the data is arranged and 
structured, closely related to the term data model 6 : we have identified three fun- 
damental data shapes: tabular, tree, and graph and respective transformations 
between them. 

The remainder of the report is organized as follows: in Section 2 we intro- 
duce and motivate the fundamental data shapes, then in Section 3 we describe 
transformations between data shapes, and, finally, in Section 4 we discuss open 
issues and challenges concerning the transformations. 

2 Fundamental data shapes 

In the following we motivate and introduce the three fundamental data shapes 
tabular, tree, and graph, derived from data structures 7 and as found in the wild in 
various datas sources, including but not limited to relational databases (RDB), 
NoSQL data stores [Catll], or Web data formats such as JSON, OData and 
RDF serialisations. 



2.1 Tabular 

A tabular data shape organizes data items into a table. A table is a set of data 
elements (values) that are organized using a model of vertical columns (identified 
by their name), and horizontal rows. A table has a specified number of columns. 
Examples of tabular data shapes are: 

- CSV (Comma Separated Values) files as of RFC 4180 s — These files are used 
to store tabular data, capable of storing numbers as well as text in a plain- 
text format that can be easily written and read by humans and software 
alike. 

— RDB (relational databases) — A relational database is essentially a group of 
tables (entities). Tables are made up of columns and rows (tuples). Those 
tables have constraints, and relationships are defined between them. Rela- 
tional databases are queried using SQL, and result sets are produced from 
queries that access data from one or more tables. 



http : //en . wikipedia . org/ wiki/Structured_data 

7 http : / /en . wikibooks . org/ wiki/Data_Structures 

8 http : //tools . ietf . org/html/rf c4180 



2.2 Tree 



A tree is a non-empty set, one element of which is designated the root of the tree 
while the remaining elements are partitioned into non-empty sets each of which 
is a subtree of the root. Tree nodes have many useful properties. The depth of 
a node is the length of the path (or the number of edges) from the root to that 
node. The height of a node is the longest path from that node to its leaves. The 
height of a tree is the height of the root. A leaf node has no children, its only 
path is up to its parent. 

A particular case of a tree is the key-value data shape — a linked list of key- 
value pairs. Examples of tree data shapes are: 

— XML (extensible Markup Language) — An open and flexible format used to 
exchange a wide variety of data on and off the Web. XML is a tree structure 
of nodes and nested nodes of information where the user defines the names 
of the nodes 9 . 

— JSON (JavaScript Object Notation) — A lightweight data-interchange for- 
mat. It is easy for humans to read and write as well as straightforward for 
machines to parse and generate 10 . 

— YAML (YAML Ain't Markup Language)— A Super-set of JSON and general- 
purpos data serialization language designed to be human-friendly and work 
well with modern programming languages for common everyday tasks 11 . 



2.3 Graph 

A graph is a mathematical structure consisting of a set of vertexes (also called 
nodes), and a set of edges. An edge is a pair of vertexes. The two vertexes are 
called edge endpoints. A graph may be either undirected or directed. Intuitively, 
an undirected edge models a "two-way" or "duplex" connection between its 
endpoints, while a directed edge is a "one-way" connection, and is typically 
represented by an arrow. 
Examples of graphs are: 

— RDF (Resource Description Framework) — A family of World Wide Web 
Consortium (W3C) specifications originally designed as a metadata model. 
It has come to be used as a general method for conceptual description or 
modeling of information that is implemented in web resources, using a variety 
of syntax formats 12 . 

— Topic Maps — Topic maps are an ISO standard for describing knowledge 
structures and associating them with information resources 13 . 

9 http : //www . w3 . org/XML/ 

10 http://json.org/ 

11 http://yaml.org/ 

12 http://www.w3.org/RDF/ 

13 http://www.isotopicmaps.org/ 



3 Data shapes transformations 



In this section we compare the possible transformations we can perform between 
two given data shapes. To this end, we have identified a set of features along 
three dimensions — the input, the output, and the transformation process — and 
provide motivational usage scenarios per transformation. We acknowledge that 
the characterisations and the formats presented in following are neither exhaus- 
tive nor complete, however, serve as a useful starting point. 

— Dimension 1 — concerning the input data shape: 

• The generic data shape, e.g., tabular, tree or graph. 

• The specific implementation of the data shape, e.g., XML, JSON, rela- 
tional database, ect. 

— Dimension 2 — concerning the output data shape: 

• The generic data shape, e.g., tabular, tree or graph. 

• The specific implementation of the data shape, e.g., XML, JSON, rela- 
tional database, ect. 

— Dimension 3 — concerning the transformation process: 

• The transformation process can be declarative or operational. 

* Declarative. There is a transformation description, the transforma- 
tion is based on a language that describes the mappings between the 
input and output shapes. 

* Operational. The transformation is only based on an ad-hoc trans- 
formation engine. 

• The transformation process can have an information loss (also known as 
lossy transformation) defined by: "all queries that are possible on the 
original shape are also possible on the resultant shape" . We have infor- 
mation loss when we change the abstraction level; this happens typically, 
when we transform a "richer" shape into a "less rich shape", e.g., from 
graph to tabular. 

The Table 3 illustrates all possible transformations between two given data 
shapes and provides pointers to the respective subsections where we discuss them 
in further detail. 



Table 3.1. Data shapes transformations overview. 



from/to 


tree 


tabular 


graph 


tree 


cf. Section 3.1 


cf. Section 3.2 


cf. Section 3.3 


tabular 


cf. Section 3.4 


cf. Section 3.5 


cf. Section 3.6 


graph 


cf. Section 3.7 


cf. Section 3.8 


cf. Section 3.9 



3.1 Tree Tree 



In this case, the transformation takes as input a given tree and outputs another 
tree. Let us suppose we have a set of XML documents that contain the description 
of the transactions of a company, and we need to submit these tin JSON files 
instead of XML, so we need to perform a transformation from tree to tree. 
Examples of these transformations are: 

- XML to XML. An XSLT that turns a DocBook 14 file into XHTML. 

- XML to JSON. A program that turns a XML file into JSON, or, for example 
via XSLT. 

3.2 Tree Tabular 

This transformation takes as input a tree and outputs a tabular. Let us suppose 
we have a set of XML that contain the description of the transactions of a com- 
pany, and we need to submit these to an entity such as a government agency that 
works with CSV files instead of XML, so we need to perform a transformation 
from tree to tabular. Examples of these transformations are: 

- XML to RDB: 

• In [FH09] a technique is described to transform XML into a RDB. The 
thecnique relies on the XSD of the XML. 

• The connect xml-2-db tool 15 relies on mapping files. 

- XML to CSV: 

• XSLT 16 . 

• Scripts 17 . 

3.3 Tree Graph 

This transformation takes as input a tree and outputs a graph. Let us suppose 
we have a set of XML document that contain the description of the transactions 
of a company, and we need to submit these to an entity such as a government 
agency that works with RDF for integration purposes. In this setup, we need to 
transform from tree to graph. Examples of these transformations are: 

- For example, with Gleaning Resource Descriptions from Dialects of Lan- 
guages (GRDDL) 18 one can turn an OData document 19 file into a corre- 
sponding RDF representation. 

- Rhizomik ReDeFer 20 that includes XSD20WL, XML2RDF. 

- XSPARQL 21 is a query language combining XQuery and SPARQL for trans- 
formations between RDF and XML. 

14 http://www.docbook.org/ 

15 http : / /www . skyhawksystems . com/users_guide/runningxml2db . htm 

16 http://www.w3.org/TR/xslt 

17 For example, http://www.ricebridge.com/xml-csv-convert.htm 

18 http : //www . w3 . org/TR/grddl/ 

19 http://www.odata.org/ 

20 http://rhizomik.net/redefer/ 

21 http : / /www . w3 . org/Submission/ xsparql- language- specification 



3.4 Tabular Tree 



This transformation takes as input a tabular shape and outputs a tree shape. 
Let us suppose we have a relational database containing transaction data of a 
company, and we need to submit these transactions that requires, for integra- 
tion purposes, the data in XML, so we need to transform from tabular to tree. 
Examples of these transformations are not standardised, but there are bespoke 
systems such as: 

- XML representation of a relational database 22 . 

- XMLSpy Relational Database Integration 23 . 

- CSV-to-XML 24 . 

3.5 Tabular Tabular 

This transformation takes as input a tabular and outputs a tabular. Let us 
suppose we have a relational database that contain the description of the trans- 
actions in our company. We need to display these transactions in the company 
web page. To this end, we have to transform from a tabular (RDB) to a tabular 
(web page). Examples of these transformations are: 

- RDB to RDB: SQL SELECT. 

- CSV to RDB: relying on a particular DBMS import tool. 

3.6 Tabular Graph 

This transformation takes as input a tabular and outputs a graph. Let us suppose 
we have a relational database that contain the description of the transactions in 
our company, and we need to submit these transactions into the central office in 
London. For integration purposes the central office is using RDF, so we need to 
transform from tabular to graph. Examples of these transformations are: 

- RDB to RDF: 

• W3C's RDB2RDF activity 25 : Direct Mapping and R2RML, a language 
for expressing customized mappings from relational databases to RDF 
datasets. 

- CSV to RDF: 

• XLWrap - language 

• TopBraid - tool 

• RDF extension of Google Refine - tool - GUI 

- RDB to Topic maps [NP09]. 

22 http : //www . w3 . org/XML/RDB . html 

2,5 http: //www. altova. com/xmlspy/database-xml .html 

24 http: //csv2xml . sourcef orge .net/ 

25 http : //www . w3 . org/2001/sw/rdb2rdf / 



3.7 Graph Tree 



This transformation takes as input a graph and outputs a tree. Let us suppose we 
have an RDF dataset for representing the statistical information of a company 
and we need to transfer this information to an XML-based format such as PC- 
Axis, used by the Irish CSO. Examples of these transformations are: 

- Turning RDF to XML 26 . 

— XSPARQL 27 is a query language combining XQuery and SPARQL for trans- 
formations between RDF and XML. 

— Geo2KML 28 web service which converts RDF to KML suitable for showing 
on Google Earth & Maps. 

3.8 Graph Tabular 

The transformation takes as input a graph and outputs a tabular. Let us suppose 
we have an RDF dataset for representing the statistical information of a com- 
pany and want to use Google Charts for visualising it. This requires a tabular 
representation in CSV and therefore we have to perform a transformation from 
graph to tabular. Examples of these transformations are: 

- SPARQL SELECT 

— ad-hoc conversion scripts. 

3.9 Graph Graph 

This transformation takes as input a graph and outputs a graph. Let us suppose 
we have an RDF dataset for representing the statistical information of a com- 
pany, expressed in the W3C RDF Data Cube vocabulary 29 . Now, further assume 
that someone is still using the deprecated SCOVO 30 vocabulary for representing 
the statistical information. Therefore we need to transform our data expressed in 
RDF Data Cube to SCOVO. In this case, we have to perform a transformation 
from a graph to graph. Examples of these transformations are: 

- RDF to RDF: 

• SPARQL CONSTRUCT. 

• R2R 31 . 

- JSON to RDF: JSON to RDF web service 32 

— RTM 33 is a vocabulary that can be used to describe the mapping of an RDF 
vocabulary to topic maps in such a way that RDF data using that vocabulary 
can be converted automatically to topic maps. 

//www. w3 . org/wiki/ConverterFromRdf 

//www. w3 . org/Submission/xsparql- language- specification 
/ / graphite . ecs . soton . ac . uk/ geo2kml/ 
//www. w3 . org/TR/vocab- data- cube/ 
/ /purl . org/NET/ scovo# 
//www4 . wiwiss . fu-berlin . de/bizer/r2r/ 
/ /graphite . ecs . soton . ac . uk/ rdf 2 j son/ 
/ / www . ontopia . net /t op icmaps /mater ials/rdf 2tm . html 



3.10 Summary 



In Table 3.2 we provide a summary of the data shapes transformation and their 
characteristics. 



Table 3.2. Data shapes transformations comparison. 



Input 


Output 


Nature 


Lossy? 


Standard 


Tabular (RDB) 


Tabular (RDB) 


Declarative 


No 


SQL 


Tabular (RDB) 


Tree (XML) 


Operational 


No 


No 


Tabular (RDB) 


Graph (RDF) 


Declarative 


No 


RDB2RDF 


Tree (XML) 


Tabular (RDB) 


Operational 


No 


No 


Tree (XML) 


Tree (XML) 


Declarative 


No 


XSLT 


Tree (XML) 


Tree (XML) 


Declarative 


No 


XSLT 


Graph (RDF) 


Tabular (RDB) 


Declarative 


Yes 


SPARQL SELECT 


Graph (RDF) 


Tree (XML) 


Declarative 


Yes 


No 


Graph (RDF) 


Graph (RDF) 


Declarative 


No 


SPARQL CONSTRUCT 



4 Discussion 

Motivated by our experiences gathered in data integration projects as well as 
in standardisation activities within W3C we wanted to provide a principled 
overview on the fundamental data shapes and transformations between them. 
Summarising, we can state the following: 

— We can perform (loss-less) data shape transformations between certain shapes. 

— A number of data shape transformations are already standards or in the 
process of being standardised, including: 

• For RDB2RDF, see R2RML and Direct Mapping. 

• For XML2XML, see XSLT. 

• For XML2RDF, see GRDDL. 

— We found that some data shape transformations are declarative in nature 
and it would be interesting to learn if others can and should be expressed 
declaratively as well. 

— To this end, we have not taken provenance information in the transformation 
process into account. Again, this is something worthwhile to follow up on. 

— In certain cases we have to deal with lossy transformations. A more sys- 
tematic study of these cases, including an assessment of the implications 
concerning the data integration process is subject to future research. 

We hope that the report in the current form is useful for both researchers 
and practitioners alike and consider it as one contribution in helping to establish 
a discussion around data shapes and their transformations in order to advance 
the state of the art. 
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